Statistic Machine Translation Boosted with Spurious Word Deletion

نویسندگان

  • Shujie Liu
  • Chi-Ho Li
  • Ming Zhou
چکیده

Spurious words usually have no counterpart in other languages, and are therefore a headache in machine translation. In this paper, we propose a novel framework, skeleton-enhanced translation, in which a conventional SMT decoder can boost itself by considering the skeleton of the source input and the translation of such skeleton. By the skeleton of a sentence it is meant the sentence with its spurious words removed. We will introduce two models for identifying spurious words: one is a context-insensitive model, which removes all tokens of certain words; another is a context-sensitive model, which makes separate decision for each word token. We will also elaborate two methods to improve a translation decoder using skeleton translation: one is skeleton-enhanced re-ranking, which re-ranks the n-best output of a conventional SMT decoder with respect to a translated skeleton; another is skeleton-enhanced decoding, which re-ranks the translation hypotheses of not only the entire sentence but any span of the sentence. Our experiments show significant improvement (1.6 BLEU) over the state-of-the-art SMT performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Empirical Study in Source Word Deletion for Phrase-Based Statistical Machine Translation

The treatment of ‘spurious’ words of source language is an important problem but often ignored in the discussion on phrase-based SMT. This paper explains why it is important and why it is not a trivial problem, and proposes three models to handle spurious source words. Experiments show that any source word deletion model can improve a phrase-based system by at least 1.6 BLEU points and the most...

متن کامل

Better Addressing Word Deletion for Statistical Machine Translation

Word deletion (WD) problems have a critical impact on the adequacy of translation and can lead to poor comprehension of lexical meaning in the translation result. This paper studies how the word deletion problem can be handled in statistical machine translation (SMT) in detail. We classify this problem into desired and undesired word deletion based on spurious and meaningful words. Consequently...

متن کامل

Natural Language Engineering

Most statistical machine translation systems typically rely on word alignments to extract translation rules. This approach would suffer from a practical problem that even one spurious word alignment link can prevent some desirable translation rules from being extracted. To address this issue, this paper presents two approaches, referred to as sub-tree alignment and phrase-based forced decoding ...

متن کامل

Dealing with deletion errors in MT

Content word deletion has widely been identified as a common error in statistical machine translation systems. We refer to a translation error as “content word deletion” when a content word appears in the reference translation of some sentence, but is absent from the system output translation of that sentence. For our purposes, the term “content word” does not have a precise definition, but rep...

متن کامل

Context Sensitive Word Deletion Model for Statistical Machine Translation

Word deletion (WD) errors can lead to poor comprehension of the meaning of source translated sentences in phrase-based statistical machine translation (SMT), and have a critical impact on the adequacy of the translation results generated by SMT systems. In this paper, first we classify the word deletion into two categories, wanted and unwanted word deletions. For these two kinds of word deletio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011